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Abstract 

Phylogenetic relationships of the primarily wingless insects are still considered unresolved. Even the most comprehensive 
phylogenomic studies that addressed this question did not yield congruent results. To get a grip on these problems, we 
here analyzed the sources of incongruence in these phylogenomic studies by using an extended transcriptome data set. 
Our analyses showed that unevenly distributed missing data can be severely misleading by inflating node support despite 
the absence of phylogenetic signal. In consequence, only decisive data sets should be used which exclusively comprise 
data blocks containing all taxa whose relationships are addressed. Additionally, we used Four-cluster Likelihood Mapping 
(FcLM) to measure the degree of congruence among genes of a data set, as a measure of support alternative to bootstrap. 
FcLM showed incongruent signal among genes, which in our case is correlated neither with functional class assignment of 
these genes nor with model misspecification due to unpartitioned analyses. The herein analyzed data set is the currently 
largest data set covering primarily wingless insects, but foiled to elucidate their interordinal phylogenetic relationships. 
Although this is unsatisfying from a phylogenetic perspective, we try to show that the analyses of structure and signal 
within phylogenomic data can protect us from biased phylogenetic inferences due to analytical artifacts. 

Key words: phylogenomics, ESTs, likelihood quartet mapping, conflicting hypotheses, Entognatha, Nonoculata, Ellipura, 
Protura, Diplura, Collembola, missing data. 



Introduction 

Despite enormous efforts to resolve the tree of life, several 
deep nodes are still considered unresolved. A good example 
for such problems are the unresolved phylogenetic relation- 
ships of primarily wingless insects. 

Most phylogenetic studies including multigene and 
phylogenomic analyses have recovered the monophyly of 
Hexapoda, the insect clade in a broad taxonomic sense 
(Regier et al. 2008, 2010; von Reumont et al. 2009, 2012; 
Meusemann et al. 2010; Trautwein et al. 2012). 
Furthermore, the monophyly of Ectognatha, which comprises 



insects in a strict taxonomic sense, namely jumping bristle- 
tails, silverfishes and firebrats, and winged insects, is well 
supported (reviewed in Grimaldi 2010; Trautwein et al. 
2012). By contrast, phylogenetic relationships among the 
entognathous primarily wingless insects, the Protura (cone- 
heads), Collembola (springtails), and Diplura (two-pronged 
bristletails), are unclear. Many authors consider these 
entognathous insects as being monophyletic, considering 
entognathy in which mouth parts are concealed in gnathal 
pouches (first discussed in detail by Hennig 1953) to have 
evolved in the last common ancestor of the three groups. 
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Within Entognatha, either a clade uniting Protura and 
Collembola, referred to as Ellipura (Borner 1910), or a clade 
uniting Protura and Diplura, referred to as Nonoculata (Luan 
et al. 2005), has been proposed (Ellipura [Hennig 1953; 
Kristensen 1981, 1997; Shao et al. 1999; Bitsch and Bitsch 
2000, 2004; Carapelli et al. 2000; Zhang et al. 2001]; 
Nonoculata [Giribet and Wheeler 2001; Giribet et al. 2004; 
Luan et al. 2005; Kjer et al. 2006; Mallatt and Giribet 2006; 
Misof et al. 2007; Dell ; Ampio et al. 2009; von Reumont et al. 
2009; Mallatt et al. 2010]). Other authors consider a paraphyly 
of Entognatha to be more likely, with Diplura as closest rela- 
tives to Ectognatha. Possible arguments for this hypothesis 
include the evolutionary origin of paired pretarsal claws and 
paired cerci (Kukalova-Peck 1987; Koch 1997; Beutel and 
Gorb 2006), the ultrastructure of the sperm (Dallai et al. 
2011), and the differentiation process of the embryonic 
amnion (Machida 2006) in the last common ancestor of 
Diplura and Ectognatha. 

Meusemann et al. (2010) and von Reumont et al. (2012) 
published the most relevant data sets and analyses covering 
the phylogenetic relationships among primarily wingless in- 
sects by including expressed sequence tag (EST) data of rep- 
resentatives of Protura, Collembola, and Diplura. Although 
both studies recovered the monophyly of Entognatha, 
Meusemann et al. found strong evidence for Protura and 
Diplura as closest relatives (i.e., Nonoculata) and von 
Reumont et al. for Protura and Collembola as closest relatives 
(i.e., Ellipura). These incongruent results are puzzling because 
taxon sampling of the primarily wingless insects is comparable 
in both studies, as well as the strategies used for orthology 
assignment, alignment masking, matrix optimization, and 
tree inference. 

These special circumstances put us into the exceptionally 
favorable position to analyze possible sources of incongruence 
among these two large phylogenomic data sets. Most phylo- 
genomic studies are based on concatenated supermatrices 
with low gene data coverage. Focusing on relationships 
among specific groups, many data blocks within such super- 
matrices therefore may not contain data for all taxa under 
consideration. Consequently, our starting hypothesis was that 
extensive missing data may mislead proper tree reconstruc- 
tion. To tackle this problem, we complement the publicly 
available EST data of primarily wingless insects with additional 
EST data from representatives of Japygidae (Diplura) and 
Zygentoma (silverfishes and firebrats). We took particular 
care to concatenate a data set that contains only gene data 
blocks for which entognathous hexapods and outgroups had 
gene data coverage. We call such a data set in the following a 
decisive data set. Note that the term decisiveness has been 
used before in the context of phylogenomic data sets (Steel 
and Sanderson 2010; Sanderson et al. 2010), albeit based on a 
distinct criterion. The concatenated data set is the largest 
known data set covering primarily wingless insects. It was 
this data set that allowed us to analyze the effect of the 
observed uneven distribution of missing data on the extent 
of bootstrap support (BS). Complementary to the application 
of BS measures, we applied a Four-cluster Likelihood Mapping 
(FcLM) approach (Strimmer and von Haeseler 1997), which 



has been shown to be effective in disentangling signal among 
four groups of species. The application of bootstrapping and 
FcLM helped to assess the effect of the uneven distribution of 
missing data in indecisive data sets. Complementary to the 
previously mentioned analyses, we addressed the problem of 
incongruent signal among genes in a multigene data set by 
comparing tree reconstructions based on the entire decisive 
data set with tree reconstructions based on subsets of genes 
that support incongruent hypotheses. Altogether, our 
approach provides potential explanations for contradictory 
results among phylogenomic studies by pointing out under- 
estimated sources of error and incongruence. 

Results 

Orthology Assignment, Alignment, and Alignment 
Masking 

Using the reference set of 1,886 1:1 orthologous genes (OGs), 
we identified between 52 and 682 putative 1:1 orthologous 
transcripts in the transcriptome assemblies of primarily wing- 
less hexapods (table 1) and up to 1,886 for all taxa (supple- 
mentary table S1, Supplementary Material online). We 
excluded 20 OGs that were present in the five reference spe- 
cies but absent from all other species from subsequent anal- 
yses. After alignment masking (i.e., the exclusion of multiple 
sequence alignment sections in which sequence similarity 
cannot be distinguished from random similarity of 
sequences), the concatenated superalignment was composed 
of 73 taxa with a total alignment length of 881,235 amino 
acid sites, partitioned into 1,866 genes (supplementary fig. S1; 
for gene annotations, see supplementary table S2, 
Supplementary Material online). 

Relationships among Entognathous Hexapod Lineages 
The data set M_Ento, which is decisive for addressing rela- 
tionships among the three entognathous groups, Protura, 
Collembola, and Diplura (73 taxa, 117 genes, 32,883 aligned 
aa sites), moderately supported a clade Protura + Diplura 
(Nonoculata) (fig. 1). This is compatible with the results of 
the FcLM approach (topology T-i favored, fig. 2). Tree recon- 
struction supported Collembola as closest relatives to a 
clade comprising Nonoculata and Ectognatha. The clade 
Nonoculata + Ectognatha received moderate support 
(fig. 1; supplementary fig. S2, Supplementary Material online). 

Our tree reconstructions based on a selected optimal 
subset (SOS) extracted from a complete data matrix by op- 
timizing information content and data saturation in iterative 
steps of gene and/or taxon exclusion (see MARE manual; 
Meusemann et al. 2010; Meyer and Misof 2010) (62 taxa, 
253 genes, alignment length 55,429 aa positions) yielded 
monophyletic Entognatha with moderate support and 
Nonoculata with low support (fig. 3a and table 2; supplemen- 
tary fig. S3a, Supplementary Material online). It should be 
kept in mind that this SOS is indecisive for addressing the 
relationships of Entognatha, with only one-third of all genes 
(79) of this data set being covered by all three entognathous 
groups (supplementary table S3, Supplementary Material 
online). The tree based on the data set SOS^, in which 



240 



Decisive Data Sets in Phylogenomics • doi:10.1093/molbev/mst196 



MBE 



Table 1. Primarily Wingless Hexapod Species Included in This Study, and Their Number of OGs in the Original Supermatrix and in Three Data 
Subsets. 



Order 


Family 


Species 


Source 


No. of 
Contigs 


Total no. 
of OGs 


No. of OGs 
in M_Ento 


No. of OGs 
in SOS 


No. of OGs 
in SOSto 


Protura 


Acerentomidae 


Acerentomon sp. a 


NCBI a 


1,999 


191 


117 


91 


12 


Diplura 


Campodeidae 


Campodea fragilis 


NCBI 


6,407 


370 


77 


116 


64 


Diplura 


Japygidae 


Megajapyx sp. 


this study 


57,602 


547 


105 


164 


89 


Collembola 


Neanuridae 


Anurida maritime* 


NCBI 


3,504 


328 


55 


105 


60 


Collembola 


Onychiuridae 


Onychiurus arcticus 


NCBI 


9,981 


795 


103 


183 


114 


Collembola 


Isotomidae 


Cryptopygus antarcticus 


NCBI 


1,897 


199 


49 


78 


35 


Collembola 


Isotomidae 


Folsomia Candida 


NCBI 


5,967 


442 


60 


122 


78 


Collembola 


Entomobryidae 


Orchesella cincta 


NCBI 


754 


52 


10 






Archaeognatha 


Machilidae 


Lepismachilis y-signata 


NCBI 


2,288 


270 


60 


107 


54 


Zygentoma 


Lepismatidae 


Tricholepisma aurea 


NCBI 


344 


54 


22 






Zygentoma 


Lepismatidae 


Thermobia domestica 


this study 


45,358 


682 


96 


194 


124 



Note. — M_Ento is the decisive data set in which all OGs are covered by Protura, Diplura, and Collembola; SOS and SOS m are indecisive to address the relationships of 
entognathous hexapod orders. 

3 'Acerentomon sp.: erroneously assigned as A. jranzi in Meusemann et al. (2010) and NCBI. 
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Fig. 1. Simplified phylogenetic tree of the decisive data set M_Ento. Best ML tree (RAxML v.7.2.8, PROTCAT, LG + GAMMA), based on 117 OGs that 
are covered by Protura, Diplura, and Collembola. BS is derived from 1,000 bootstrap replicates. Rogue taxa (supplementary material [section 4], 
Supplementary Material online) were pruned prior to tree inference. The tree was rooted with Capitella sp. For the full tree, see supplementary figure S2, 
Supplementary Material online. 



these 79 genes were removed to artificially create a maximally 
indecisive data set, showed Entognatha with strong support 
(table 2) and additionally, diplurans were paraphyletic 
with respect to Protura (fig. 3b; supplementary fig. S3b, 
Supplementary Material online). Both SOS data sets (11 
taxa from the supermatrix, which included the collembolan 
Orchesella cincta were removed in the optimization process) 
did not contain any rogue taxa, that is, taxa that assume 
incongruent phylogenetic positions in a set of bootstrap 
trees (Aberer and Stamatakis 2011) (supplementary material 



[section 3], Supplementary Material online). Tree reconstruc- 
tions of all data sets strongly supported monophyletic 
Ectognatha and monophyletic Hexapoda (table 2). 

Incongruent Signal among Genes 
Based on the M_Ento data set, the FcLM approach helped to 
identify a predominant signal for topology (Protura + 
Diplura) - (Collembola + remaining taxa) in 51 genes 
(12,548 aligned aa positions) (data set M_Nono, derived 
from Nonoculata), a predominant signal for topology T 2 
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(Protura + Collembola) - (Diplura + remaining taxa) in 35 
genes (11,789 aligned aa positions) (data set M_Elli, derived 
from Ellipura), and a predominant signal for topology T 3 
(Diplura + Collembola) - (Protura + remaining taxa) in 31 
genes (8,546 aligned aa positions) (data set M_DiCo) (fig. 4a 
and b). Tree inferences from data sets M_Nono, M_Elli, and 
M_DiCo (rogue taxa pruned, see Materials and Methods sec- 
tion) yielded maximal BS support for Nonoculata, Ellipura, 
and Diplura + Collembola, respectively (table 2; supplemen- 
tary fig. S4, Supplementary Material online). However, 
although tree reconstruction of our data subsets M_Nono, 
M_Elli, and M_DiCo showed maximal BS support for incon- 
gruent topologies among the entognathous insect orders, the 
results from the FcLM approach indicated that signal for 
alternative topologies was present in all data sets (fig. 4a 
and b; supplementary table S4, Supplementary Material 




Fig. 2. Results of the FcLM for all OGs in data set M_Ento. The chart 
shows the proportion of quartets (summed up for 117 OGs) that show 
predominant support for T-, ([Protura + Diplura] - [Collembola + 
remaining taxa], Nonoculata hypothesis, blue), T 2 ([Protura + 
Collembola] - [Diplura + remaining taxa], Ellipura hypothesis, red), 
and T 3 ([Diplura + Collembola] - [Protura + remaining taxa], 
yellow), see fig. 5. Quartets mapping in remaining Voronoi cells (gray) 
and T* (fig. 5) were not considered. 



online), which is not reflected by the trees. To identify possi- 
ble reasons for incongruent signal among genes, we assessed 
the correlation between functional classes of genes and the 
different phylogenetic hypotheses that are supported by the 
data subsets. We found no correlation (supplementary ma- 
terial [section 4], table S5 and fig. S5, Supplementary Material 
online). Additionally, we tested whether model misspecifka- 
tion can explain the observed incongruence among genes and 
analyzed the data set M_Ento and data subsets M_Nono, 
M_Elli, and M_Dico using partitioned phylogenetic analyses 
(Minh et al. 2013) with the best model selected for each gene 
(partition) separately (supplementary material [section 5], 
table S6, and figs. S6-S9, Supplementary Material online). 
With respect to the phylogenetic relationships addressed 
in our study, resulting topologies did not differ from 
un partitioned analyses, and BS only differed to a minor 
degree (table 2). 

Discussion 

The Importance of Data Set Decisiveness 
Incongruences in proposed relationships among Protura, 
Collembola, and Diplura in the studies of Meusemann et al. 
(2010) and von Reumont et al. (2012), which both supported 
monophyly of Entognatha, motivated us to look for new 
approaches to uncover and analyze possible sources of incon- 
gruent signal in phylogenomic data sets. 

Both SOS data sets in Meusemann et al. (2010) and von 
Reumont et al. (2012) were compiled with MARE (Meyer and 
Misof 2010) and were intended to address pancrustacean 
and arthropod relationships. Both data sets showed only 
low decisiveness for addressing the relationships of the 
three entognathous lineages: only 28 out of 128 genes in 
Meusemann et al. (2010) and 22 out of 316 genes in von 
Reumont et al. (2012) contained representatives of Protura, 
Diplura, and Collembola. 

Despite low gene data coverage in both studies, the mono- 
phyly of Entognatha received high BS. By contrast, our 
decisive data set for addressing the relationships among 
these three insect orders lacks clear support for Entognatha 
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Fig. 3. Simplified phylogenetic trees of data sets SOS (a) and SOS ra (b). Best ML tree (RAxML v.7.2.8, PROTCAT, LG + GAMMA) (a) based on 253 OGs, 
79 of which are covered by Protura, Diplura, and Collembola (SOS) and (b) based on 174 OGs, none of which are covered by Protura, Diplura, and 
Collembola (SOS^). BS is derived from 1,000 bootstrap replicates. Trees were rooted with Capitella sp. For the full trees, see supplementary figure S3a 
and S3b, Supplementary Material online. 



242 



Decisive Data Sets in Phylogenomics • doi:10.1093/molbev/mst196 



MBE 



Table 2. BS (%) for Selected Clades in Tree Reconstructions with Various Data Sets. 



Clade 






Data Set 




Data Subset of M_ 


_Ento 


A/1 fntn 

IVI LI 1 L\J 


SOS 


SOSe, 


Meusemann 
et al (2010) 


von Rpumnnt 

VUI 1 rxCUIIIvsllL 

et al. (2012) 


M_Nono 


M_Elli 


M DiCo 


Hexapoda 


100 (100) 


100 


98 


100 


99 


72 (100) 


100 (100) 


100 (100) 


Diplura 


100 (100) 


100 


a 


N.A. 


N.A. 


100 (100) 


100 (100) 


100 (100) 


Collembola 


100 (100) 


100 


100 


100 


100 


100 (100) 


100 (100) 


100 (100) 


(Protura, Diplura) b 


91 (96) 


51 


83 a 


100 


_ 


100 (100) 


— (-) 


- (-) 


(Protura, Collembola) c 


- (-) 








98 


— (-) 


100 (100) 


- (-) 


(Diplura, Collembola) 


- (-) 












-(-) 


-(-) 


99 (100) 


Entognatha 


-(-) 


81 


94 


86 


98 


-(-) 


-(-) 


-(-) 


((Protura, Diplura), Ectognatha) 


80 (96) 










98 (100) 


-(-) 


-(-) 


((Collembola, Diplura), Ectognatha) 


-(-) 










-(-) 


-(-) 


60 (83) 


(Diplura, Ectognatha) 


-(-) 










-(-) 


66 (100) 


-(-) 


Ectognatha 


100 (100) 


100 


99 


100 


100 


100 (100) 


100 (100) 


95 (84) 



Note. — BS was assessed with RAxML from 1,000 bootstrap replicates (see Materials and Method). BS printed in brackets was assessed from partitioned ML analyses of data sets 
M_Ento, and its subsets using the Ufboot algorithm of IQ-TREE with 5,000 bootstrap replicates (supplementary material [section 5], Supplementary Material online). M_Ento is 
the decisive data set in which all OGs are covered by Protura, Diplura, and Collembola; SOS, SOS ffl , and the data sets from Meusemann et al. (2010; data set SOS, ML tree) and 
von Reumont et al. (2012; data set SOS, ML tree Set 1 red ) are indecisive to address the relationships of entognathous hexapod orders. M_Nono, M_Elli, and MJDiCo are subsets of 
M_Ento with predominant signal for different topologies and point out conflict of signal among genes. 
a Diplurans are paraphyletic: Campodea + (Acerentomon,Megajapyx). 
b Nonoculata hypothesis. 
c Ellipura hypothesis. 
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Fig. 4. Detailed results of the FcLM Mapping for all OGs included in data set M_Ento and data subsets M_Nono, MJEWi, MJDiCo. (a) Histogram of FcLM 
results. Each bar refers to an OG (for OG-IDs, see supplementary table S2, Supplementary Material online). Y axis: amount of quartets (in %), that 
predominantly support T, ([Protura + Diplura] - [Collembola + remaining taxa], blue), T 2 ([Protura + Collembola] - [Diplura + remaining taxa], 
red), and T 3 ([Diplura + Collembola] - [Protura + remaining taxa], yellow), quartets that show ambiguous support are not considered (fig. 5). OGs 
with predominant support for are classified into data set M_Nono (51 genes, 12,548 aligned aa positions); OGs with predominant support for T 2 are 
classified into data set M_Elli; (35 genes, 11,789 aligned aa positions); OGs with predominant support for T 3 are classified into data set MJDiCo (31 
genes, 8,546 aligned aa positions), (b) FcLM results for data set M_Nono (left), M_EIH (middle), and MJDiCo (right). Each chart shows the proportion of 
quartets (summed up for the OGs included in the data sets) that show predominant support for T v T 2 , and T 3 (see above and fig. 5). Quartets that 
show ambiguous support (fig. 1) are not considered. 
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(fig. 1). This puzzling result might be explained by the 
presence of an uneven distribution of missing data. We 
gained indirect evidence for this hypothesis with the analyses 
of the worst case data set SOS^. This data set is maximally 
indecisive for testing the monophyly of Entognatha, that is, 
none of the included genes were common to all three entog- 
nathous insect groups. Any inferred support for this clade in 
the SOS^ analysis can be considered an artifact. Remarkably, 
bootstrapping delivered high, clearly artificial support for 
monophyletic Entognatha in the SOS^ tree (fig. 3b). 

We conclude from this indirect evidence that the support 
for Entognatha in Meusemann et al. (2010), von Reumont 
et al. (2012) and in our data set indecisive concerning this 
question (fig. 3a) probably results from an artificial signal due 
to uneven distribution of missing data (Philippe et al. 2011) 
among Protura, Diplura, and Collembola. 

Based on the analyses of the decisive and indecisive data 
sets, we reject the hypothesis that missing data are un proble- 
matic as long as many characters have been sampled overall 
(Wiens 2006). Missing data can be misleading as shown by the 
worst case SOS^ data set analysis, in which relationships 
received high BS although the data set was maximally inde- 
cisive. Therefore, we strongly advocate the exclusive use of 
decisive data sets in phylogenomic studies. 

Incongruent Signal between Genes in a Multigene 
Data Set 

Even decisive data sets can contain incongruent signal 
(Degnan and Rosenberg 2009; Knowles 2009; Philippe et al. 
201 1). Using FcLM, we identified groups of genes that support 
different relationships of Protura, Collembola, and Diplura in 
the decisive data set M_Ento (fig. 4a and b). Additionally, we 
assessed conflict within the data with split analyses relying on 
NeighborNetworks (supplementary material [section 6] and 
figs. S10-S13, Supplementary Material online). This analysis 
corroborates the results of FcLM that all analyzed data sets 
did contain incongruent signal. Additional to the problem of 
indecisiveness discussed earlier, this incongruent signal 
among genes may partly be responsible for the contradictory 
results of Meusemann et al. (2010) and von Reumont et al. 
(2012). However, incongruent signal among genes is difficult 
to address and rectify. We analyzed two potential sources of 
conflict and can conclude that both can be excluded. First, we 
tested for homoplasy due to analogous selection regimes in 
functional complexes but found no correlation between pre- 
dicted gene function and phylogenetic signal (supplementary 
material [section 4], fig. S5, and table S5, Supplementary 
Material online). Second, we were able to indirectly exclude 
model misspecifications as sources of incongruent signal be- 
cause unpartitioned and partitioned maximum likelihood 
(ML) analyses yielded topological^ congruent results and 
almost identical BS (table 2; supplementary material [section 
5], table S6, and figs. S6-S9, Supplementary Material online). 
With respect to the FcLM, it may well be that this likelihood 
mapping approach selects sets of genes with congruent sub- 
stitution processes. A possible solution, but certainly not a 
fully satisfying one, would be to increase the number of genes 
to minimize noise and confounding signal. 



Relationships of Protura, Collembola, and Diplura 

Monophyly of Entognatha 

The monophyly of Entognatha has never been maximally 
supported and this has not changed in our analyses 
(table 2). Studies encompassing representatives of Protura, 
Collembola, and Diplura are limited to only a few analyses 
(Colgan et al. 1998; Carapelli et al. 2000; Edgecombe et al. 
2000; Giribet et al. 2001, 2005). Monophyletic Entognatha 
were recovered in all recent studies based on nuclear rRNA 
genes (Gao et al. 2008; Dell'Ampio et al. 2009; von Reumont 
et al. 2009; Mallatt et al. 2010). However, BS was low, which 
was either explained by character choice (Dell'Ampio et al. 
2009) or the influence of nonstationary processes across taxa 
(von Reumont et al. 2009). From the morphological point 
of view, most apomorphies suggesting the monophyly of 
Entognatha represent reductions (malpighian papillae vs. 
tubules; reduction to loss of compound eyes). The only ex- 
ception is the evolution of mouthparts that are concealed in 
gnathal pouches (Beutel and Gorb 2006). Diplura as closest 
relatives to Ectognatha is the only relation that contradicts 
monophyletic Entognatha, and for which morphological 
evidence has been published (Kukalova-Peck 1991; Koch 
1997; Beutel and Gorb 2006; Dallai et al. 2011). In general, 
morphological support for any clade encompassing more 
than one of the entognathous lineages Protura, Diplura, 
and Collembola is weak, largely because character polarization 
is problematic. This is due to the lack of applicability of char- 
acters and/or missing comparative studies in the crustacean 
groups that are discussed to be most closely related to 
Hexapoda (Szucsich and Pass 2008). 

Ellipura versus Nonoculata 

Molecular analyses mostly support Nonoculata (Protura + 
Diplura) (Giribet et al. 2004; Luan et al. 2005; Kjer et al. 2006; 
Mallatt and Giribet 2006; Misof et al. 2007; Dell'Ampio et al. 
2009; von Reumont et al. 2009; Mallatt et al. 2010; see 
Dell'Ampio et al. 201 1 for a review) while most morphologists 
merge Protura and Collembola into Ellipura (Borner 1910; 
Hennig 1953; Kristensen 1981, 1997; Kukalova-Peck 1987; 
Bitsch and Bitsch 2000, 2004; Beutel and Gorb 2006). 
Molecular evidence for Ellipura is weak and limited to three 
mitochondrial single-gene analyses (Shao et al. 1999; Carapelli 
et al. 2000; Zhang et al. 2001), and morphological support for 
Nonoculata is nearly missing (Szucsich and Pass 2008). These 
controversies call for phylogenomic approaches. 

The majority of the 117 genes that compose the decisive 
data set M_Ento contain predominant signal for Nonoculata 
(fig. 4a). Also, the FcLM analysis of M_Ento (fig. 2) and the 
phylogenetic tree of M_Ento (fig. 1) yielded monophyletic 
Nonoculata, albeit not being well supported. In summary, 
Nonoculata is slightly favored over Ellipura in our study, but 
the question of the phylogenetic relationships of the three 
entognathous hexapod orders remains unsettled. 

Conclusions 

Clades may be incorrect, even if receiving high BS support 
(e.g., monophyly of Entognatha in Meusemann et al. [2010], 
von Reumont et al. [2012], and in data sets SOS and SOS^ of 
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this study). This is a trivial conclusion and different reasons 
are mentioned in the literature (Lehtonen 2011, Simmons 
and Freudenstein 2011). We show that an uneven distribu- 
tion of missing data (i.e., the use of indecisive data sets) can 
lead to strongly supported, yet incorrect, clades. To avoid 
misleading phylogenetic conclusions from seemingly robust 
trees based on phylogenomic data sets, we advise 1) using 
only data sets that are decisive for the phylogenetic question 
of interest, 2) including an alternative measure of support 
(Salichos and Rokas 2013); our method of choice was the 
FcLM approach, and 3) analyzing and documenting the in- 
ferred incongruence of signal between genes. 

In our decisive data set, we found strong incongruence 
among genes that is neither correlated with functional 
classes of genes nor with model misspecifkations in unparti- 
tioned analyses. Based upon these notes of caution, we found 
no signal for the monophyly of Entognatha, and we found no 
strong signal for Ellipura or Nonoculata despite extending our 
data set with additional data from key taxa. In other words, 
the phylogeny and evolution of early hexapods remains enig- 
matic. Despite this, we show that there are valuable lessons 
to be learned from the analyses of phylogenomic data of 
primarily wingless insects, particularly in terms of incongru- 
ence among genes and data decisiveness. 

Materials and Methods 

Taxon Sampling and New Transcriptome Data 
Our taxon sampling included 73 species: 46 hexapods, and, 
as outgroup species, 25 crustaceans, the chelicerate Ixodes 
scapularis, and the polychaete worm Capitella sp., both pre- 
sent in the reference set of taxa used for orthology assignment 
(discussed later). Transcriptome assemblies of 71 species were 
obtained from the Deep Metazoan Phylogeny database 
(http://www.deep-phylogeny.org/, last accessed November 
4, 2013). We only used species for which more than 1,000 
contigs were available (status: December 2011), with two ex- 
ceptions: the springtail Orchesella cincta (Collembola, 
Entomobryidae, 754 contigs) and the silverfish 
Tricholepisma aurea (Zygentoma, Lepismatidae, 344 contigs), 
the only publicly available zygentoman transcriptome assem- 
bly (supplementary table S1, Supplementary Material online). 

We generated new transcriptome data for Megajapyx sp. 
(Diplura, Japygidae) and the firebrat Thermobia domestica 
(Packard 1837) (Zygentoma, Lepismatidae) (table 1). 
Extraction of RNA, complementary deoxyribonucleic acid 
(cDNA) library construction, library normalization, and 454 
pyrosequencing of ~ 1,000,000 ESTs per species using the GS- 
FLX Titanium System, ROCHE were carried out at the Max 
Planck Institute for Molecular Genetics (MPIMG), Berlin, 
Germany. Vector clipping, trimming, and soft masking of 
raw reads and assembly into contigs was conducted at the 
Center for Integrative Bioinformatics (CIBIV), Vienna, Austria. 
Steps at the MPIMG and the CIBIV were done as described in 
von Reumont et al. (2012) and Simon et al. (2012), for details 
see supplementary material (section 1; Supplementary 
Material online). Raw sequence reads were deposited at the 
National Center for Biotechnology Information (NCBI), 



Sequence Read Archive (accession numbers Megajapyx 
sp.: SRR400673; T. domestica: SRR400672). Transcriptome 
assemblies of Megajapyx sp. (accession numbers JT047774- 
JT094274) and T. domestics (accession numbers T494145- 
JT533227) were deposited at the Transcriptome Shotgun 
Assembly (TSA) Database, NCBI Bioproject ID PRJNA81579 
and PRJNA81 581 (http://www.ncbi.nlm.nih.gov/bioproject, 
last accessed November 4, 2013). For submission, we excluded 
contigs shorter than 200 bp, according to the submission 
guidelines; the full transcriptome assemblies are available at 
http://zfmk.de/bioinformatics/Full_Transcriptome_ 
Assemblieszip (last accessed November 4, 2013). 

Orthology Assignment 

To identify 1:1 OGs in our transcriptome assemblies, we used 
the Hidden Markov Model based Search for Orthologs using 
Reciprocity (HaMStR) pipeline (Ebersberger et al. 2009; http:// 
www.deep-phylogeny.org/hamstr/, last accessed November 4, 
2013), version 4. As reference set for clusters of OGs, we used a 
set of 1,886 1:1 OGs (represented by amino acid sequences) 
based on five reference species (supplementary material [sec- 
tion 2] and table S2, Supplementary Material online). We 
defined orthology being present if bi-directional best 
hits were found between our transcript sequences and the ref- 
erence species Daphnia pulex, Ixodes scapularis, Apis mellifera, 
and Capitella sp. 

Alignment, Alignment Masking, and Concatenation 
We aligned amino acid sequences using MAFFT L-INS-i 
(Katoh and Toh 2008) v.6.850 for each gene separately. 
Afterwards, randomly similar aligned sections were identified 
with a modified version of ALISCORE (Misof B and Misof K 
2009; Kuck et al. 2010; Meusemann et al. 2010; for modifica- 
tions, see Meusemann et al. 2010) using the following options: 
default sliding window size; -r: maximum number of pairwise 
sequence comparisons; -e: special scoring for gappy amino 
acid data. Identified randomly similar aligned sections were 
masked with ALICUT v.2.0 (Kuck 2009; www.utilities.zfmk.de, 
last accessed November 4, 2013). Masked alignments were 
concatenated into supermatrices with FASconCAT v.1.0 
(Kuck and Meusemann 2010). 

Design of Decisive and Indecisive Data Sets 
We extracted all genes from the supermatrix that contain at 
least one representative of each 1) Protura, 2) Diplura, 3) 
Collembola, and 4) remaining species to generate a decisive 
data set among entognathous lineages. The resulting data set 
is called MJEnto. 

We generated two additional data subsets from the orig- 
inal supermatrix: 1) A so-called selected optimal subset (SOS), 
generated with MARE v.0.1.2-rc (Meyer and Misof 2010; 
http://mare.zfmk.de, last accessed November 4, 2013), apply- 
ing taxon weighting -t 7.5. This approach is analogous to 
Meusemann et al. (2010) and von Reumont et al. (2012). 2) 
From this SOS data set, we compiled a data set called SOS^ 
by removing all genes that were covered by all three 
entognathous lineages to receive a maximally indecisive 
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"worst case" data set in which each gene contained maximally 
two entognathous lineages. 

Four-Cluster Likelihood Mapping 
Additional to tree reconstruction with BS, we applied the 
FcLM approach using the M_Ento data set (Strimmer and 
von Haeseler 1997). We binned sequenced species into four 
clusters: 1) Protura (1 species), 2) Diplura (2 species), 3) 
Collembola (5 species), and 4) remaining species (65 species) 
(supplementary table S1, Supplementary Material online). 
Next, we 1) estimated the tree-likeness of each gene, that is 
the amount of quartets that showed support for one out 
of the three possible topologies and 2) evaluated which of 
the three possible topologies was supported by the majority 
of those quartets (predominant support): T-i (Protura + 
Diplura) and (Collembola + remaining taxa), T 2 
(Protura + Collembola), and (Diplura + remaining taxa), or 
T 3 (Diplura + Collembola) and (Protura + remaining taxa) 
(fig. 5). The competing hypotheses of Meusemann et al. 
(2010) and von Reumont et al. (2012) are represented by 
either (Nonoculata hypothesis) or T 2 (Ellipura hypothesis); 
the third topology T 3 does not represent a currently debated 
hypothesis. FcLM was conducted using TREE-PUZZLE v.5.2 
(Schmidt et al. 2002; http://www.tree-puzzle.de, last accessed 
November 4, 2013), applying the BLOSUM62 substitution 



matrix (Henikoff S and Henikoff JG 1992) as the BLOSUM62 
substitution matrix is implemented in the software MARE 
(Meyer and Misof, 2010; http://mare.zfmk.de, last accessed 
November 4, 2013). 

For each gene in the data set M_Ento, we calculated the 
proportions of quartets that predominantly supported either 
topology T v T 2 , or T 3 . According to the topology that was 
supported by the majority of quartets, we classified each 
gene into one of three groups, supporting Nonoculata, 
Ellipura, or Diplura + Collembola (fig. 5 and supplementary 
table S4, Supplementary Material online). Quartets for which 
the support remained ambiguous (T 12 , T 23 , T 13 , and T*; fig. 5) 
were not used for classification (see supplementary fig. S14 
[Supplementary Material online] for the results with all 
quartets). All classified genes (supplementary table S4, 
Supplementary Material online) were subsequently concate- 
nated into three submatrices called M_Nono (genes support- 
ing Nonoculata), M_Elli (genes supporting Ellipura), and 
M_DiCo (genes supporting Diplura + Collembola). 

Phylogenetic Tree Inference 

ML tree reconstruction was done from all data sets: M_Ento, 
M_Nono, M_Elli t and MJDiCo, SOS, and SOS^ (discussed 
earlier). We estimated evolutionary models for each data 
set with ModelGenerator v.0.85 (Keane et al. 2006). The 




jF Protura f# Collembola Diplura Q remaining taxa 

Fig. 5. 2D simplex graph. Voronoi cells are areas, in which quartets show predominant or maximal support for either of the three topologies T v T 2 , T 3 , 
or in which quartets show ambiguous support T 12 , T 13> T 23 , and T*. For further explanations, refer to Strimmer and von Haeseler (1997, fig. 3). Voronoi 
cell corresponding to T, (blue): quartets show support for (Protura + Diplura) - (Collembola + remaining taxa); Voronoi cell corresponding to T 2 
(red): quartets show support for (Protura + Collembola) - (Diplura + remaining taxa); Voronoi cell corresponding to T 3 (yellow): quartets show 
support for (Diplura + Collembola) - (Protura + remaining taxa); Voronoi cells corresponding to T 12 , T 13 , T 23 (gray) do not show clear support for T q , 
T 2 , and T 3 ; in T* all topologies are equally likely. 
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best fitting model was selected based upon the Akaike 
Information Criterion (AIC; Akaike 1974). ML trees were 
inferred with RAxML (Stamatakis 2006), V.7.2.8-ALPHA, 
HYBRID (Ott et al. 2007; Pfeiffer and Stamatakis 2010) using 
the CAT model of rate heterogeneity (Stamatakis 2006) and 
the LG protein substitution matrix (Le and Gascuel 2008). 
Final tree searches were conducted under the GAMMA 
model of rate heterogeneity (Yang 1 996). Bootstrap analyses 
were performed with the rapid algorithm (Stamatakis 2006), 
which also included subsequent searches for the best scoring 
ML tree. We obtained BS for each node from 1,000 rapid 
bootstrap replicates, and checked a posteriori if sufficient 
bootstrap trees were computed using the bootstopping 
criteria (Pattengale et al. 2010, default settings). ML analyses 
were conducted on a Linux cluster at the Cologne High 
Efficient Operating Platform for Science (CHEOPS), 
Regionales Rechenzentrum Koln (RRZK), using eight nodes 
with 12 cores each. 

After tree inference, we scrutinized our trees for rogue taxa 
(Aberer et al. 2013; Aberer and Stamatakis 2011, see supple- 
mentary material [section 3], figs. S2 and S4, table S7, 
Supplementary Material online, for details and results). 
We removed sequences corresponding to taxa that were 
identified as rogues from the concatenated alignments 
and repeated the tree inferences. All trees were edited with 
Treegraph v.2.0 (Stover and Muller 2010), and rooted with 
Capitella sp. Data sets are deposited at Dryad: http://doi.org/ 
1 0.5061 /dryad.mk8p7 (last accessed November 4, 2013). 

Supplementary Material 

Supplementary material (sections 1-6), tables S1-S7, and fig- 
ures S1-S14 are available at Molecular Biology and Evolution 
online (http://www.mbe.oxfordjournals.org/). 
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